93 research outputs found
A Decomposition Theorem for Maximum Weight Bipartite Matchings
Let G be a bipartite graph with positive integer weights on the edges and
without isolated nodes. Let n, N and W be the node count, the largest edge
weight and the total weight of G. Let k(x,y) be log(x)/log(x^2/y). We present a
new decomposition theorem for maximum weight bipartite matchings and use it to
design an O(sqrt(n)W/k(n,W/N))-time algorithm for computing a maximum weight
matching of G. This algorithm bridges a long-standing gap between the best
known time complexity of computing a maximum weight matching and that of
computing a maximum cardinality matching. Given G and a maximum weight matching
of G, we can further compute the weight of a maximum weight matching of G-{u}
for all nodes u in O(W) time.Comment: The journal version will appear in SIAM Journal on Computing. The
conference version appeared in ESA 199
An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings
A widely used method for determining the similarity of two labeled trees is
to compute a maximum agreement subtree of the two trees. Previous work on this
similarity measure is only concerned with the comparison of labeled trees of
two special kinds, namely, uniformly labeled trees (i.e., trees with all their
nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled
trees with distinct symbols for distinct leaves). This paper presents an
algorithm for comparing trees that are labeled in an arbitrary manner. In
addition to this generality, this algorithm is faster than the previous
algorithms.
Another contribution of this paper is on maximum weight bipartite matchings.
We show how to speed up the best known matching algorithms when the input
graphs are node-unbalanced or weight-unbalanced. Based on these enhancements,
we obtain an efficient algorithm for a new matching problem called the
hierarchical bipartite matching problem, which is at the core of our maximum
agreement subtree algorithm.Comment: To appear in Journal of Algorithm
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
The past decade has witnessed many interesting algorithms for maintaining
statistics over a data stream. This paper initiates a theoretical study of
algorithms for monitoring distributed data streams over a time-based sliding
window (which contains a variable number of items and possibly out-of-order
items). The concern is how to minimize the communication between individual
streams and the root, while allowing the root, at any time, to be able to
report the global statistics of all streams within a given error bound. This
paper presents communication-efficient algorithms for three classical
statistics, namely, basic counting, frequent items and quantiles. The
worst-case communication cost over a window is bits for basic counting and words for the remainings, where is the number of distributed
data streams, is the total number of items in the streams that arrive or
expire in the window, and is the desired error bound. Matching
and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on
Theoretical Aspects of Computer Science (STACS), 201
Cavity Matchings, Label Compressions, and Unrooted Evolutionary Trees
We present an algorithm for computing a maximum agreement subtree of two
unrooted evolutionary trees. It takes O(n^{1.5} log n) time for trees with
unbounded degrees, matching the best known time complexity for the rooted case.
Our algorithm allows the input trees to be mixed trees, i.e., trees that may
contain directed and undirected edges at the same time. Our algorithm adopts a
recursive strategy exploiting a technique called label compression. The
backbone of this technique is an algorithm that computes the maximum weight
matchings over many subgraphs of a bipartite graph as fast as it takes to
compute a single matching
BASE: a practical de novo assembler for large genomes using long NGS reads
© 2016 The Author(s). Background: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.published_or_final_versio
MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
© 2017 The Author(s). Background: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. Results: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. Conclusion: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta.Link_to_subscribed_fulltex
SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner
To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A
SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
To tackle the exponentially increasing throughput of Next-Generation
Sequencing (NGS), most of the existing short-read aligners can be configured to
favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging
the computational power of both CPU and GPU with optimized algorithms, delivers
high speed and sensitivity simultaneously. Compared with widely adopted
aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including
BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while
maintaining the highest sensitivity and lowest false discovery rate (FDR) on
Illumina reads with different lengths. Transcending its predecessor SOAP3,
which does not allow gapped alignment, SOAP3-dp by default tolerates alignment
similarity as low as 60 percent. Real data evaluation using human genome
demonstrates SOAP3-dp's power to enable more authentic variants and longer
Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly
discovered deletions. SOAP3-dp natively supports BAM file format and provides a
scoring scheme same as BWA, which enables it to be integrated into existing
analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and
Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files
available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE".
Comments most welcom
Food-sorting jet arrays and target impact properties
This thesis uses numerical techniques and analysis to study the development and interactions between multiple in-line slender air jets. Consideration is given to two-and three-dimensional flow regimes, but the emphasis is on the latter. The applications (and mechanisms) involved in high-speed machine sorting of small food items, such as grains of rice, are explained. The underpinning mathematics required to develop the mathematical model are stated. In Chapter 2 an analytical solution for the two-dimensional steady jet is demonstrated and used to provide a far-downstream asymptote for validation of the numerical scheme, for steady and unsteady jets. A numerical scheme is demonstrated to be versatile and reasonably accurate. Small-distance analysis complements the numerical scheme and limitations are discussed. A comprehensive small-time analysis is undertaken, results from which support later work on three-dimensional jets. Interference between inline jets is considered in Chapter 3, which applies methods previously used to study two-dimensional in-parallel wakes. The conclusions from this chapter support and help explain results in later chapters. The numerical scheme is extended to three-dimensional steady and unsteady jets. Issuing nozzles of various cross-sections are considered with the aim of obtaining pressure data for comparison with physical data. Small-distance analysis is again investigated, enabling a weakness in the numerical solution to be highlighted. Potential flow theory is used to model interference aspects of multiple in-line unsteady three-dimensional jets. The emphasis is placed on jets from nozzles of either circular or rectangular cross-section but, in fact, the analysis applies for any cross-section. The impact properties of a typical jet when it hits one of the particles such as a grain of rice being sorted are discussed briefly, and final comments are made
- …